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Abstract. Anomalies persist in the foundations of ridge regression as set forth in Hoerl 
and Kennard (1970) and subsequently. Conventional ridge estimators and their proper- 
ties do not follow on constraining lengths of solution vectors using LaGrange's method, 
as claimed. Estimators so constrained have singular distributions; the proposed solu- 
tions are not necessarily minimizing; and heretofore undiscovered bounds are exhibited 
for the ridge parameter. None of the considerable literature on estimation, prediction, 
cross— validation, choice of ridge parameter, and related issues, collectively known as ridge 
regression, is consistent with constrained optimization, nor with corresponding inequality 
constraints. The problem is traced to a misapplication of LaGrange's principle, failure 
to recognize the singularity of distributions, and misplaced links between constraints and 
the ridge parameter. Other principles, based on condition numbers, are seen to validate 
both conventional ridge and surrogate ridge regression to be defined. Numerical studies 
illustrate that ridge analysis often exhibits some of the same pathologies it is intended 
to redress. 



1. Introduction 

Given the full-rank model Y = X(3 + e with zero-mean, homoscedastic, and uncorrelated 
errors, the ordinary least squares ( OLS) estimators solve the k equations X'X/3 = X'Y 
on minimizing Q{(3) = {Y — Xl3y{Y — X(3). Ill-conditioned models long have posed spe- 
cial challenges, in that often exhibits excessive length, inflated variances, instability, and 
other intrinsic difficulties. Noting these, Hoerl (1962, 1964) considered ad hoc solutions j3ji 
— {fUfi^ — {X'X + XIk)^^X'Y; X > 0} and noted their successful applications in chemical 
engineering. Analyses built around these have been labeled ridge regression in statistics, 
although Levenberg (1944) and Riley (1955) earlier posed such solutions in numerical anal- 
ysis. Noting that OLS "does not have built into it a method for portraying sensitivity of 
the solutions to the estimation criterion," Hoerl and Kennard (1970) sought mathematical 
foundations beyond Gauss's principle with its inherent limitations. Specifically, they as- 
serted that j3j^ are solutions minimizing Q{(3) subject to the constraint {(3'f3 = c^}. Others 
identify ridge regression instead with the constraints {/3'/3 < c^} of Balakrishnan (1963); 
however, Hoerl and Kennard (1970), p. 64, specifically relegate this to approaches other 
than ridge regression. 
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Ridge estimators abound, based on estimative, predictive, cross-validative, and numerous 
other criteria, typically giving disparate choices for A. Even the early simulations of Demp- 
ster, Schatzoff, and Wermuth (1977) identified 57 ridge and related shrinkage estimators. 
An expository survey and numerical examples are provided in Myers (1990). In short, a 
considerable literature, spanning the past thirty-six years, rests on the foundations of Hoerl 
and Kennard (1970), ostensibly the mathematics of constrained optimization, to remedy 
defects of OLS in ill-conditioned systems. 

In fact, little of the collective literature known as ridge regression is consistent with the 
constrained optimization of Hoerl and Kennard (1970), nor with corresponding inequality 
constraints. Here the problem is traced to (i) a misapplication of LaGrange's principle, (ii) 
failure to identify singular distributions, and (iii) invalid links between the constraints and 
the ridge parameters. These errors are evident also in Marquardt (1970), Marquardt and 
Snee (1975), Golub, Heath and Wahba (1979), van Nostrand (1980), and elsewhere through- 
out the literature. In consequence, much that is known about ridge regression rests on a 
false premise. By analogy, Hoerl and Kennard (1970) considered generalized ridge regression 
as solving the modified equations [X'X + A)/3 = X'Y, with nonnegative ridge parameters 
A = Diag(Ai, . . . , Afc). As noted later, these solutions again are inconsistent with LaGrange 
minimization. In summary, not to denigrate its usefulness in practice, the collective body 
of ridge regression rests on little more than heuristics. To the contrary, aspects of ridge 
regression have proven useful enough, often enough, to deserve sound rationale for their 
implementation. In this spirit we seek to supplant the missing foundations with alternatives 
based on conditioning of the linear system X'X^ = X'Y. An outline follows. 

Supporting developments comprise Section 2, to include notation and the basics of in- 
variance and condition numbers. Section 3 reexamines LaGrange optimization in linear 
inference. Section 4 develops supporting rationale for ridge regression as currently prac- 
ticed, and an alternative approach using surrogate ridge models. A case study in Section 5 
revisits an ill-conditioned data set considered elsewhere. Section 6 concludes with a brief 
summary. 

2. Preliminaries 

2.1. Notation. The symbols K*^ and designate Euclidean fc-space and its positive or- 

thant; F„fc and F^^, comprise the real and complex {n x k) matrices of rank k < n; and 
Sfe and designate the real symmetric {k x k) matrices and their positive definite vari- 
eties. The transpose, inverse, trace, and determinant of A e F^^ are A', A~^, tr{A), and 
\A\, and V* is the conjugate transpose of V e F^j,. Groups of note include U{k) as the 
unitary (fc x k) matrices, and 0{k) as the real orthogonal group. Special arrays are the 
(fc X k) identity Ik, the unit vector 1^ = [1, 1, . . . , 1]' G R*^, and the diagonal matrix £)„ = 
Diag(ai, . . . , Ofc). The mapping <t{X) = [^i, . . . , ^^j' takes X G F^^ into its ordered singular 
values {^1 > • . . > ^fe > 0}. The singular decomposition is X = UDV* , such that D = 
Diag(Z)|, 0) of order (n x k), JDj = Diag(^i, . . . , ^fe), U e U{n), and V e U{k), where the 
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columns of C/ = [iti, . . . , tx„] and V = [vi, . . . ,Vk\ comprise the left- and right- singular 
vectors of X. Equivalently write X = UiD^V* = Yli=i ^i'^i'^^t with Ui = [ui, . . . ,Uk], 
and its Moore-Penrose inverse as = VD^U*, with = Diag(D^\ 0) of order {kxn). 
Specifically, the real model Y = Xf3 + e in canonical form becomes Y = PD^9 + e, where 
X — PD^Q' and 6 = Q' (3 is an orthogonal reparametrization. For Y E M" random, 
designate its mean vector, its dispersion and correlation matrices as E(l^) = /x, V(l^) = S, 
and C{Y) = R, and its law of distribution as C{Y). 

2.2. Invariance and Conditioning. A function 'tjj{-) on F^^^ is called unitarily invariant 
if, for each G G F^j, and any unitary matrices U G U{n) and V G U{k), it follows that 
= ?/^(f/GV*). Then i){G) depends on G only through its ordered singular values 
(t(G) = [71, . . . ,7fe]'. Let $ comprise the symmetric gauge functions on R*^ such that for 
each (/)(•) e $, (i) </'(mi, . . . ,u/c) is symmetric under the 2^k\ permutations and reflections 
about the origin; (ii) 4'{u) > when w ^ 0; (iii) is homogeneous, i.e., (j){cu) = \c\ (j){u) 
for c 7^ 0; and (iv) (l){u + v) < 4>{u) + 4>{v). Let ^ comprise the unitarily invariant matrix 
norms on F^^,; von Neumann (1937) demonstrated that these are generated as {|1 • H^; </> € $} 
with II GII0 — (/'(71, . . . ,7fe)- Corresponding nornis on '^nk iiiva,ria.nt under X — > PXQ , 
with (P, Q) e 0(n)xO(fc); see also Schattcn (1970) and Marshall and Olkin (1979). In 
particular, the Frobenius norm on F„fe is || X \\f = [tr(X'X)]i/2 = (X;*Li itY'"^ in terms of 
the singular decomposition X = PD^Q', with || ■ || as the Euclidean norm on R''. 
Two types of conditioning are germane to the present study: 

Type A Conditioning: Stability of the solution z of the linear system Az = b, when the 
coefficients A e ¥kk are subjected to small perturbations, is gauged by the condition number 
Cg{A) = g{A)g{A~^), where g{-) ordinarily is a norm. The system is well conditioned a,t A = 
Ik with Cg{Ik) = 1.0, larger values reflecting greater ill-conditioning. Specifically, with g{A) 
= II ^||0, then {c0(-); </) G <&} comprise the unitarily invariant Type A condition numbers, 
so that {C(p{A) = ||A II^IIA^^'^II^; (p G as treated in Marshall and Olkin (1979), Horn and 
Johnson (1985), and elsewhere. In particular, take ci{A) = ai/a^, where {ai > . . . >otk} 
are the ordered eigenvalues of A. 

Type B Conditioning: The concept of elasticities is invoked in Belsley, Kuh and Welsch 
(1980) to link sensitivities of solutions, and of variances of = {Z'Z)~^Z'Y, with dis- 
turbances in the data matrix Z G F„fe, as gauged by its condition number ci{Z) = S,i/^k, 
with a{Z) = [^1, . . . , ^k]'- Here Z is the result of scaling the columns of X to have (approx- 
imately) equal lengths. More generally, the unitarily invariant condition numbers on F„fe 
are c^{X) = (j){X)(l){X^), with X^ as the Moore-Penrose inverse. The system is well con- 
ditioned at X = PIkQ', where c^{PIkQ') = c^{Ik) = 1.0, larger values reflecting greater 
ill-conditioning. In summary, Belsley et al. (1980) proceed to scale the columns of X ^ .Z 
to have approximately equal lengths, and to focus on ||.Z||0i = ^1, so that ci(Z) = £,i/S,k- 
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3. The Principal Issues 

3.1. LaGrange's Method. Given differentiable functions . . . , U/t) and g{ui, . . . , Uk) 
such that the gradient V(?(ui, . . . , Uk) 7^ on Go = {u G R*^ : g{u) ~ 0}, the problem is to 
minimize /(ui, . . . , Ufe) subject to the constraint . . . , Uk) — 0. Write L{ui, . . . , Uk, A) = 

. . . , Uk) + A[g(ui, . . . , Uk) — 0]. It is necessary that gradient vectors in M*^ be parallel, 

i.e, 

yj{ui,...,Uk) = AVg(ui,...,Ufe), (3.1) 

whereas 

dLiui, . . . , life, A)/9A = [5(^1, . . . , Ufc) - 0] (3.2) 

recovers the constraint. LaGrange's principle requires solving the fc + 1 equations, (3.1) 
and (3.2), in the k+1 unknowns {ui, . . . , u^. A}. To minimize f{ui,...,Uk) subject to 
17(^1, . . . , Uk) > 0, define the Lagrangian L{u, A) = fiu) - \g{u). StuezlcQ has given condi- 
tions for u* to be a solution, namely, (i) g{u*) > 0; (ii) WuL{u* , A*) ~ 0; (iii) X*g{u*) = 0; 
and (iv) A* > 0. 

For constrained least squares the objective function is now 

L{f3,,...,(3k,X)^Q{f3) + X{f3'(3-c^) 

with 0(/3) = (Y - Xf3y{Y - Xj3) as before. Corresponding to (3.1) and (3.2) are 

(X'X + XIk)(3 = X'Y (3.3) 

/3'/3 = c2 (3.4) 

to be solved for the A: + 1 unknowns (/3i, . . . , A). Designate these as 1/3^,, A} such that 
/3^/3j, = c^, as apparently intended by Hoerl and Kennard (1970). If instead Q{(3) is to 
be minimized subject to {0 (3 < c^}, then the constrained solution (3^ satisfies I3q = (3j^ 
whenever /3^/3^ < c^, and otherwise {X'X + XIk)l3f^ — X'Y for some A > such that 
/3g/^ = c'^, as shown in Balakrishnan (1963). See also the conditions (i)-(iv) of Stuezle 
(2005) as cited. 

3.2. Ridge Regression: A Survey. We recall essentials of conventional ridge regression 
as set forth principally in Hoerl and Kennard (1970), Marquardt (1970), and Marquardt and 
Snee (1975). For continuity we retain their notation, with their {fc,p,/3,/3 } corresponding 
to our {A, fc,/3£,/3fl^} and, on occasion, (3j^^ = (3^ = f3 (A). Accordingly, write the residual 
sum of squares as = (y-X5)'(r-Xb) = 0„„„ + 0(6), where 0™„ = {Y-X0'{Y-X0 
and (f>{b) = {b — f3)'X'X{b — f3). Various assertions have been set forth, as enumerated here 
for later reference. 

Al. Hoerl and Kennard (1970), p. 57: "3* = [Ip + k{X'X)'^]~^^ (2.3)*." 
A2. Hoerl and Kennard (1970), pp. 58-59: "The ridge trace can be shown to be 
following a path through the sums of squares surface so that for a fixed cj) a single value for 



^Stuezle, W., "Chapter 5. Notes on Ridge Regression," online notes for BioStat 538, Winter 2005, 
University of Washington, at www.stat.washington.edu/wxs/Stat538-w05 
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b is chosen and that is the one with minimal length." Precisely: "Minimize b'b subject to 
(6 - ^yx'X{b -^)=(j,o (3.2)*." "This reduces to 6 = 3* = {X'X + kiy^X'Y where k 
is chosen to satisfy the restraint (3.2)*." 

A3. Hoerl and Kcnnard (1970), p. 59: "Of course, in practice it is easier to choose a 
A; > and then to compute (t>o. In terms of /3 the residual sum of squares becomes 4'*{k) 
= iY- xfy{Y - Xp*) = 0™„ + k^p*'{X'X)-^p* (3.6)*." 

A4. Hoerl and Kennard (1970), p. 59: "A completely equivalent statement of the 
problem is this: If the squared length of the regression vector b is fixed at R , then /3 is 
the value of b that gives a minimum sum of squares. That is, /3 is the value of b that 
minimizes the function Fi = {Y - Xby{Y - Xb) + {l/k){b'b - R^) (3.7)*." 

A5. Marquardt and Snee (1975), p. 5: "If 3* is the solution of {X'X + kl)^* = g, then 

^* 

/3 minimizes the sum of squares of residuals on the sphere centered at the origin whose 
^* 

radius is the length of /3 ." Here g = X Y. 

3.3. Properties of Solutions. We next examine distributions of the constrained solutions 
/3g, subject to /3c/3c — ^ i to continue the unfinished work of Hoerl and Kennard (1970), and of 

under inequality constraints. To these ends identify the sphere S'c = {m G R'^ : u'u = c^} 
and the open ball Be = {u G U.'^ : u'u < c^}, both of radius c, and the complement 

= {u gR'^ : u'u > c^}. Accordingly, let /i(-) be the probability measure on M'^ induced 
through 'K — > /3^; let fJ-sd') be the measure on Sc C M'^ induced through solutions /3g of 
(3.3) and (3.4); and let y^sA') be the nonsingular measure on B^ C M*^ induced through 
£{Pi^\(3i^Pi^<c^). Stochastic properties of (3^ and /3q are given next. 

Theorem 1. Let (3^ e M*^ be the constrained solution satisfying (3.3) and (3.4)- and let 
0Q GMJ^ minimize Q{Pi, . . . ,/3k) subject to {f3'P < c^}, with /xo(-) as its probability measure 

on mJ'. 

(i) The joint distribution £(y9c) = Fc{b) corresponding to fJ-sS') singular on K*^ of rank 
k-1. 

(a) The measure /io(') for 0^^ admits the mixture representation 

HoiA) = a ■ ij,bA^) + ^ ■ fJ-sM) (3.5) 

with mixing probabilities a = 1 — a G (0, 1), S2ich that 

(Hi) ^Ba{A) = {l^{^c)]~^ Jji^lBcit)dn{t), where /^^(t) is the indicator function; and 
(iv) a = ^i{Bc). 

Proof: Conclusion (i) is immediate, since (3^ G R constructively lies on the sphere (3^(3^ = 
(? . We proceed by conditioning on the exclusive outcomes € B^ and e B'(.. Clearly 3) 
takes the value with probability a = P(/3^/3j;, < c^) = n{Bc), where the conditional mea- 
sure corresponding to >C(/3^ \ < is hbS-^) — [m(-Sc)]~^ JA^BA^)diJ,{t), as asserted, 
to give conclusion (iii). Similarly, (3^ takes the value (3^ with probability a = 1 — a as in 
(iv), its conditional measure as in (i), to complete our proof. □ 

Observe that the singular distribution £(/3g) of conclusion (i) may be added to the list 
of distributions arising in the analysis of directional data, to include the von Mises-Fisher 
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distributions, for example. For further reference see Batschelet (1981), Fisher (1993), Fisher, 
Lewis and Embleton (1993), Evans, Hastings and Peacock (2000), and Mardia and Jupp 
(2000). Conclusion (ii) for 0^ complements the work of Balakrishnan (1963) in the context 
of linear estimation. Moreover, under Gaussian errors, a = iJ-{Bc) derives from a weighted 
sum of k independent noncentral chi-squarcd random variables, each having a single degree 
of freedom; see Kotz, Johnson and Boyd (1967). 

3.4. A Critique. We next reexamine the critical assertions of Section 3.2. 

Assertion Al: False. As noted, the solution necessarily lies on the sphere (3^/3^ = (? 
and thus has a joint singular distribution in R*' of rank fc — 1. To the contrary. Assertion 
Al implies that /3 (A) has a nonsingular distribution for each A > 0, yet /3 clearly refers to 
the constrained solution throughout Section 3 of Hoerl and Kennard (1970). The assertion 
is false, applying to solutions of (3.3) only, as there is no one-to-one linear transformation 
taking (ij^ onto the sphere /3 ^ = c^. In consequence, expression (3.6)* of Hoerl and Kennard 
(1970) is in error, as are its implications, since the term A;^/3 '{X'X)~^P derives from the 
inapplicable Assertion Al. 

Assertion A2, and its dual A4, appear to be essentially intact. The exception is that 
"l/Zc" in expression (3.7)* of Hoerl and Kennard (1970) instead should be "fc." 

Assertion A5: False. This assertion arises as the dual to A3, excluding (3.6)* of Hoerl 
and Kennard (1970). The basic idea is to solve (3.3) as (3 (A) for fixed A > 0, and then to 
discover the implied constraint {f3'/3 = c*^} at (3.4) on evaluating f3 '/3 = c*^. However, the 
solution f3 (A) need not minimize the residual smn of squares SS{X) = [Y — X(3 {X)]'[Y — 
Xl3 (A)], as claimed. This fallacy stems from the tacit but unfounded assumption that A 
and correspond one-to-one. To the contrary, it is demonstrated in Section 5 that multiple 
solutions may have the same length but different As, for example, || /3 (Ai) || = || /3 (A2) || with 
Al < A2. But then the solution 13 (A2) cannot be minimizing, as SS{X2) > SS{Xi) from the 
monotonicity of SS{X). In this regard Figure 3 of Marquardt and Snee (1975) is particularly 
misleading. Assertions A2, "for a fixed cp a single value for b is chosen and that is the one 
with minimal length," and A5, that "/9 minimizes the sum of squares of residuals on the 
sphere centered at the origin whose radius is the length of f3 ," often are misrepresented as 
equivalent assertions regarding solutions /3fj^ of (3.3) alone. See van Nostrand (1980), for 
example. 

To continue, for fixed c define the equivalence class 

A(c) = {A:||3*(A)||=c}, (3.6) 

and let Ac = min{A(c)}. Then Assertion A5 may be corrected as follows. 

Assertion A5*. If 3*(A) is a solution of (X'X+XI)^* = X'Y having length || 3*(A) ||= c*, 
— 

then f3 (A*) minimizes the sum of squares of residuals on the sphere centered at the origin 
whose radius is the length c* of ^ , where A* = min{A(c*)}. 

Assertion A5* has profound consequences in practice. Of the many schemes devised for 
choosing the ridge parameter A, the user then must examine the corresponding equivalence 
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class for each such A. If it is a singleton set, then the solution thus attained is minimiz- 
ing. Otherwise the algorithm A5* must be implemented to attain the minimizing solution. 
Further details are provided in Section 5.3. 

It is clear that f3^ is the LaGrange solution minimizing Q(/3) subject to {/3'/3 — c^}. To 
the contrary, Hoerl and Kennard (1970), Marquardt (1970), Marquardt and Snee (1975), 
Golub et al. (1979), and others concerned with constrained optimization, instead take /Sjj^ 
as the ridge estimator, solving (3.3) alone for some A > 0. Together with Assertion A5, this 
is tantamount to asserting that the k linear equations (3.3) somehow embody the constraint 
(3.4) as well, which they clearly cannot. Yet /3^, not /3,,, comprise the ridge estimators on 
which essentially all of ridge regression now rests. Assertion Al clearly holds for solutions 
/3^^ satisfying (3.3) only. 

Confusion persists in the meaning of ridge regression. Bunke (1975), Hocking (1976), 
and Tibshirani (1996), for example, assert that ridge regression embodies the inequality 
constraint {/3'/3 < c^}, despite the disclaimer of Hoerl and Kennard (1970). Yet nowhere 
do these authors acknowledge the constrained solution /3q of Balakrishnan (1963), nor its 
properties as in Theorem 1, opting instead for the ridge solutions {/3fj^; A > 0} of Hoerl 
(1962, 1964). On the other hand, the inequality-constrained solution /3q does have the 
nonsingular mixture distribution of Theorem 1. However, we arc aware of no work in ridge 
regression that explicitly accounts for the structure of cither or of /3q as in Theorem 1. 

In short, ridge regression in its present form rests essentially on through an accident of 
history. Indeed, expressions for variances and biases; solutions for A purporting to minimize 
expected mean squares; prediction, validation, and cross-validation; and other aspects of 
ridge regression; all are predicated on Assertion Al. If instead either f3^ or (3^ were taken 
as starting points, as required under the aegis of constrained optimization, then the ensuing 
"ridge regressions" would differ dramatically from the conventional one based on {Pn^', A > 
0}, together with the critical but false Assertion Al, with now corresponding to A. These 
differences necessarily would include issues such as (i) the stability of the solutions f3^ or /3q 
instead of (3fi^ , in comparison with : (ii) the inflation of variances, taking into account 
actual variances to be derived from Theorem 1 as reference; (iii) prediction using Yc = Xf3^ 
or Yq = instead of Ynx = ^I^Rx'^ (i'^) ^he use, meaning, and properties of cross- 

validative and predictive criteria based on Yc or Yq, instead of Yji^; (v) ridge traces as 
modified to take into account $q and singularity of the joint distribution of /S^; and (vi) the 
trade-off between bias and variance of the constrained estimators 13^ and 0q, as determined 
using actual moments to be derived from Theorem 1. Other differences may be noted. 
All such properties would have to be established anew, complicated considerably by the 
nonstandard distributions encountered in Theorem 1. 

By analogy, Hoerl and Kennard (1970) further considered generalized ridge regression 
invoking the k equations {X'X + A)(3 = X'Y, with A = Diag(Ai, . . . , Afe) as nonnegative 
ridge parameters. Note that this, too, cannot have resulted from LaGrange minimization: 
Given that {/3j = c^, . . . , = c|}, the only function of the data now would be to determine 
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signs of the roots — ±ci, . . . , f3k ~ ±Cfe}. On the other hand, if inequahty constraints 
{/?! < Ci, . . . ,/3| < c|} are invoked instead, then correct solutions are provided by Myoken 
and Uchida (1977) akin to those of Balakrishnan (1963) where {Ai = • • • = A/j = A}. 

4. Foundations Via Conditioning 

We seek substitutes for the failed principle of constrained optimization as a basis for 
conventional ridge regression. In what follows we consider {iSn^ ; A > 0} as solutions to (3.3) 
alone as in Hoerl (1962, 1964), without reference to constrained optimization and discredited 
assertions thereto as noted. Type A conditioning of the linear system X'Xf3 = X'Y 
prompts the modification {X'X + XIk)f3 — X'Y, from the perspective of both numerical 
analysis (Levenberg (1944) and Riley (1955)) and of statistics (Hocrl (1962, 1964)). A 
survey is provided subsequently. Moreover, the Type B conditioning of 1^ = X(3 + e is also 
germane, since the conditioning of X'X depends on that of X, and for further reasons to 
be cited. A new approach to ill conditioned systems, using surrogate ridge models, rests 
essentially on Type B conditioning. Details follow. 

4.1. Background. Ill-conditioned models typically arise from nonorthogonality of columns 
of X. Let W = X'X and V = {X'X)-\ Since ¥(3^) = a^V, the variance inflation factors 
{VIFs) of3i = 0L„...,j3i^y are defined as {VIF(3l,.) = Vjj/w~/; l<j< k}, i.e., the ratio 
of the actual variance to the "ideal" variance attained when columns of X are orthogonal, so 
that W = Diag(wii, . . . , Wkk)- Often Y = Zf3 + e is taken with Z'Z in "correlation form" 
having unit diagonal elements; then {VIF(fiL.) = Vjj;l < j < k} are diagonal elements 
of y = (Z'Z)-'^ from the scalc-invariance of VIFa. With {Vi > V2 > . . . > Vk} as the 
ordered diagonal elements of V, Marquardt and Snee (1975) identify Vi to be "the best 
single measure of the conditioning of the data," thus a critical diagnostic tool. See also 
Marquardt (1970), Beaton, Rubin and Barone (1976), and Davies and Hutton (1975). A 
basic connection between VIFs and condition numbers is due to Berk (1977): 

Lemma 1. Given Z'Z in correlation form, with {Fi > V2 > . . . > 14} as the ordered diagonal 
elements ofV = {Z'Z)~^. Then the condition number ci{Z'Z) satisfies 

Vi<ci{Z'Z)<k{Vi + --- + Vk). (4.1) 

Since {c^{A) — c^{A~^); (j) € from Section 2.2, the Type A condition number for Z'Zj3 = 
Z'Y is identical to c^[V{P]^)], so that Lemma 1 is really about dispersion parameters in the 
equivalent form 

Vi<ci[Y0L)]<k{Vi+--- + Vk). (4.2) 

4.2. Ridge Regression. That X'X {X'X + Xlk) improves conditioning has been cited 

by Marshall and Olkin (1979) as a justification for ridge regression. In brief, their Theorem 
C.3, p. 273, asserts that for any {A, B) e §^ such that C4,{B) < c^(A), with {c^(-); G 
as in Section 2.2, then c^{A + B) < Cff,{A). Riley (1955) showed that B = Xlk satisfies the 
hypothesis of the theorem for any A G S'^ , where A depends on numerical considerations. 
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This holds for any Type A conditioning of Az = b ^ + ^Ik)z = 6 as in Section 2.2, and 
thus in particular for X'X(3 = X'Y — > {X'X + Xlk)^ = X'Y, as noted by Marshall and 
Olkin (1979), p. 273, to give Type A conditioning as a basis for ridge regression. Moreover, 
using condition numbers ci(-), the improvement is seen directly on comparing ci{X'X) 
= ejH with ci(X'X + \h) = {il + A)/(e^ + A), where a{X) = Essential 
properties of /3j, and /3fj^ are summarized in Table 1, along with the surrogate estimator, 
fig^ , to be defined subsequently. 

Table 1. Properties of {(^^lt f^Rx' f^Sx} under Gauss-Markov assumptions, 
where Xx = PDiag( Vlf+A, . . . , V^TX)Q' and A^ = {X'X + A/fe). 



Estimator 


Definition 


E(/3) 


V(/3) 




(x'xy^x'Y 


/3 


a2(X'X)-i 




A-^X'Y 


A-^X'Xf3 


a^A-'X'XA-' 




A^'X'^Y 


A^'X',Xf3 


a^A^^ 



4.3. Surrogate Models. Nonetheless, the correspondence Az = b < — > X'X(3 = X'Y 
is incomplete in the context of linear inference, since both A = X'X and b = X'Y are 
subject to disturbances in X. This has not been taken into account. In particular, ridge 
solutions satisfying (X'X + XIk)f3 = X'Y, despite improved conditioning on the left, still 
are subject to the ill conditioning of X on the right. To correct this oversight, we invoke 
Type B conditioning from Section 2.2 on observing that X'X —> (X'X + A/fe) is tantamount 
to modifying X itself as a means to enhanced conditioning. In particular, begin with the 
singular decomposition X = PD^Q'; let X^ = PDiag(-\/^f + A, . . . , -^/^^ + X)Q'\ observe 
that (X'X + AJfe) = Xj^X^; and note that ridge regression entails Xj^X^/Q = X'Y. Instead, 
we take Y = X\/3 + e as an approximation, or surrogate, for the ill-conditioned model 
Y = X/3 + e itself, as in the following. 

Definition 1. Given an ill-conditioned model Y — Xf3 + e, its ridge surrogate is a modified 
model Y = Xxj3 + e. The surrogate estimator f3g^, solving X^XxP = ^xY, is OLS for the 
surrogate model. 

To continue, the order of approximation of X^ for X may be gauged by the Frobenius 
distance 

fe / , s 2" 



|^-^a||f= 



=1 



1/2 



(4.3) 



from the unitary invariance of || • ||f • Moreover, the conditioning of X'xXx(3 = Xj^Y now 
may be gauged through Type B conditioning as in Section 2.2. For later reference, basic 
properties of {/Qj^iA > 0} are summarized in Table 1. It remains to compare properties 
of {(^L^l^Rx^^Sx}- Direct comparisons are somewhat obscure; however, these become more 
transparent on invoking canonical forms to be considered next. 
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4.4. Canonical Forms. The singular decomposition X = PD^Q\ with P'P = Ik, to- 
gether with the orthogonal reparametrization = Q'/3, gives Y = Xfi + e — > V = 
PD^Q'l3 + e^U = P'Y = D^O + P'e, such that E(P'e) = and V(P'e) = (7^P'l„P 

= (J^Ik under Gauss-Markov assumptions regarding the errors of Y = Xf3 + e. Accord- 
ingly, E([/) = D^e and \{U) = cr^/fe. In canonical form it follows that Ol = {DI)-^D(U = 
D^^U, E(0l) = 6>, and V(6»l) = a^U^^ ^nder OLS, as given in Table 2. Similar expressions 
for the canonical ridge estimators {6r^ ; A > 0}, and the canonical surrogate ridge estimators 
A > 0}, are reported in Table 2. Since 3 = Qd, E(3) = QE(0), and V(3) = QM{e)Q' 
for all three estimators, Tabic 1 follows directly from Tabic 2, and conversely. Moreover, 
issues regarding the conditioning of {(3j^, (3^^, /3g^}, as linear data transformations, and con- 
ditioning of the corresponding dispersion matrices {V(/3£), V(/3^^), V(/3s^)}, are considered 
subsequently. These can be established directly in terms of those of {6l, 0r^,0Sx}j since Q 
is orthogonal and condition numbers here are unitarily invariant. 



Table 2. Properties of {dL,dRx,dsx} under standard Gauss-Markov as- 
sumptions, where U = P'Y and D{uji) = Diag(a;i, . . . , Wfe). 



Estimator 


Definition 


m 


V(?) 






e 


















a^Dil/i^f + X)) 



Specifically, in canonical form we have D^6l = U, so that the Type B condition number 
Ci{X) = ci{D^) = ^i/Cfe properly gauges the sensitivity of the solution to disturbances 
in X. Similarly, with = Diag((^i2 + x)/£,l,..., {£,1 + X)/Q, observe from D^dR^ = U 
that its condition number gauges sensitivity of the solution Or^ , and thus of /3fj^ = QOr^ to 
perturbations in X, from the orthogonality of Q. This underscores the central role of Type 
B conditioning from Section 2.2, as set forth in Belsley et al. (1970). 

4.5. Central Issues. Several issues, to be examined empirically in Section 5, appear to be 
open questions not addressed in the voluminous literature on ridge regression. Intrinsic diffi- 
culties with OLS include (i) nonorthogonality of the columns of X, as reflected in Ccf,{X) and 
Cci){X'X); (ii) instability of solutions linked to the conditioning of the data transformation 
(3i^{Y) = {X'X)~^X'Y , considered as a function of V; and (iii) pathologies in dispersion 
parameters as reflected in VIFs and the ill-conditioning of V(/3j;,). Moreover, at some level 
the conditioning of E(/3^^) = T(/3) becomes an issue in transforming the parameter space, 
as in assessing the trade-off between variance and bias. As ridge regression seeks remedies, 
it is pertinent to ask how well the ridge solutions progress towards those ends. Regarding 
item (i), the apparent "correlations" in W = X'X, namely {wij/ ^wuwjj}, are taken into 
{wij/ ^J [wii -\- X){wjj + X)} as elements of (X'X+XIk). These in turn decrease in magnitude 
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with increasing A. Nonetheless, ridge solutions themselves are subject to nonorthogonality, 
together with attendant difficulties regarding stability, VIFs, and conditioning of their dis- 
persion matrices. Improving stability of the solutions thus hinges on the conditioning of 
f3ji^ (Y) when considered as a data transformation. Moreover, the capacity to ameliorate 
dispersion problems of OLS hinges on improving VIFs and condition numbers for V(/33^). 
On the other hand, it is widely known that /3^^ shrinks stochastically towards the origin, as 
do its mean and dispersion matrix, with increasing A. These issues in turn prompt several 
questions to be considered subsequently. 

Ql: Does it follow that stability of (3j^^{Y) necessarily improves with increasing A? 
Q2: Given that V(3flJ = cr^{X'X + A4)-iX'X(X'X + XIk)-\ does it follow that 

condition numbers ci[V(/3^^)] decrease with increasing A? 
Q3: With regard to variance inflation, does it follow that VIFs for elements of /3^^ 

decrease with increasing A? 
Q4: Viewing E(/3jj^) = T{l3) as a transformation on the space of parameters, does it 

follow that its conditioning improves with increasing A? 

For completeness, observe that the foregoing issues pertain not only to the ridge estima- 
tors a > 0} themselves, but also to other biased solutions to include {l3g^; A > 0}. 

We next undertake a comparative study of properties of ridge and surrogate ridge solu- 
tions, to be continued in the case studies of Section 5. 

4.6. Some Comparisons. Regarding the conventional {/3fl^;A > 0} and surrogate ridge 

{/3g^ ; A > 0} estimators, both shrink stochastically towards the origin with increasing A, as 
do their means and variances, and similarly for {Or^; A > 0} and {Os^^', A > 0}. Specifically, 
for a given A, it is seen from Table 2 that 6s^ achieves lesser shrinkage, both in expectation 
and variance, than Or^ . 

Condition numbers for various arrays are given in Table 3 for the canonical estimators 
{6l, 6fj^, Os^}. These arrays include (i) coefficients defining 9{U) with reference to stability 
of the solutions; (ii) coefficients defining the parameter transformations E(0) — T{9); and 
(iii) the dispersion matrix V{9). Entries in Table 3 follow directly from Table 2 and the 
definition of ci(-), on recalling that elements of = Diag(^i, . . . ,^k) are ordered as {^i > 
• • • > ^fe > 0}. Observe, moreover, that the rows of Table 3 may be identified equivalently as 
{(3]^, l3j^^, (3g^}, and the columns as {ci[/3(y)],ci[r(/3)],ci[V(/3)]}, respectively. This follows 
since f3 = Qd, (3 = Qd, and V(/3) = QY{d)Q', Q is orthogonal, and the condition numbers 
are unitarily invariant. 

Note further that ci[3l(^)] = ci(X) and ci[V(3i)] = ci{X'X), whereas ci[^Sxi^)] 
= Ci{Xx) and Ci[V(/3g^)] = Ci{X'^X\), as both are OLS in their respective models. More- 
over, both condition numbers, ci[(3g^{Y)] = £,i + ^/ VC't + ^^'^ i^s square ci[V(/3j^)], 
decrease monotonically with increasing A, thus assuring improved conditioning for the sur- 
rogate estimators. Condition numbers associated with Or^, and thus with 0^^, are more 
convoluted and will be examined further in Section 5. 
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Table 3. Condition numbers for data transformations 0{U), for parameter 
transformations E(0) = T{d), and for V(0), for each of {Ol, 6r^ , ds^}- 



Estiiua.tor 


n[e(U)] 




ri[V(§)] 


0L 




1.00 




Orx 


max{Ci/(e?+A)} 




max{C?/(C?+A)^} 


mm{C,/(e^+A)} 




min{4i=/({,^+A)2} 











5. Case Studies 

5.1. The Data. We reexamine the Hospital Manpower Data as reported in Myers (1990). 

Records at n = 17 U. S. Naval Hospitals include: Y : Monthly man-hours; Xi : Average 
daily patient load; X2 : Monthly X-ray exposures; X3 : Monthly occupied bed days; X4 : 
Eligible population in the area 1000; and X5 : Average length of patients' stay in days. 
The basic model is 

= /3o + l3iXi + + P3X3 + PiXi + [35X5 + e,;l<i<n. (5.1) 

Following Hoerl and Kennard (1970), Marquardt (1970), Marquardt and Snee (1975), Myers 
(1990), and others, we center and scale the model, so that Y = ZjS + e with Z'Z in correla- 
tion form, the central focus being the rates of change /3 = [Pi, (^2, Pa, (34, P5]' . The data are 
given in Table 3.8, pp. 132-133, of Myers (1990), and computations were done mostly using 
PROC IML of the SAS Programming System. The data arc exceedingly ill-conditioned: 
Elements oi arc = Diag(2. 048687, 0.816997, 0.307625, 0.201771, 0.007347); ci(Z'Z) 
= 77,754.86; the maximal VIF in OLS estimation is Vi = VIF0i) = 9,595.685; and other 
VIFs appear subsequently in Table 8 at A = 0. 

5.2. Choices for A. Widely diverse criteria have evolved in the choice for A, with profound 
consequences regarding ridge estimators, ridge predictors, and their properties. Five criteria 
in common usage are reported in Table 4, together with definitions and their values as 
determined for the Hospital Manpower Data. These include DF\ = tr(iTx) with H\ = 
[Z{Z'Z + Xhy^Z']: the cross-validation PRESSx statistic of Allen (1974); a rotation- 
invariant version called Generalized Cross Validation {OCV\) by Golub et al. (1979); C\ 
as a device for variance-bias trade-off as in Mallows (1973); and HKBx as recommended 
by Hoerl, Kennard and Baldwin (1975) based on simulation studies. As listed in Table 4, 
SSiies,\ is the residual sum of squares using ridge regression; is the OLS residual mean 
square; and {e^. are the PRESS residuals for ridge regression. Further details are given in 
Myers (1990), pp. 392-411, including numerical values for DF\, C\, and PBESSx as reported 
in Table 4. Further choices include A e {0.01,0.03,0.05,0.07,0.09} and others to be noted 
subsequently. 
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Table 4. Choices for A in the Hospital Manpower Data corresponding to 
conventional criteria DFx, OCVx, Cx, PBESSx, and HKBx. 



Name 


Definition 


Value for A 


DFx 
OCVx 
Cx 


HHx) = Ell (1^^ 

ln-(l+tr(ff,))J^ 

[^i^-n + 2 + 2tT{Hx)] 


0.0004 
0.004787 
0.0050 


PRESSx 


2^z=i ^(i,x) 


0.2300 


HKBx 




0.616964 



5.3. Minimizing Solutions. Often a definitive value for the constraint {/3'/3 = c^} is not 

apparent in a particular study. This motivates the dual Assertions A3 and A5 of Section 
3.2: (i) choose A; (ii) solve (3.3) for /Sj^^: (iii) evaluate the implied constraint at (3.4) as 

^Rx^Rx ~ '^*^'' ^"^"^ (^^) 8.ssert as in A5 that the solution so attained "minimizes the sum of 
squares of residuals on the sphere centered at the origin whose radius is the length" of /J^^ . 
We have claimed that Assertion A5 is false. Evidence is provided in Table 5, where lengths 



Table 5. Lengths of (3^^, and square roots of residual sums of squares 
R(X) = [{Y - ZPjiJ{Y - ZPjiJji, for designated values of A. 



A 


0.00 


0.04 


0.08 


0.12 


0.16 


0.20 


0.24 


0.28 


\\I3rx\\ 


394.67 


137.82 


33.14 


31.50 


70.02 


99.19 


122.10 


140.73 


m 


2129.53 


2474.87 


2735.75 


2914.38 


3057.54 


3184.84 


3305.00 


3422.13 


A 


0.32 


0.36 


0.40 


0.48 


0.56 


0.60 


0.64 


0.68 




156.25 


169.40 


180.69 


199.00 


213.09 


218.93 


224.11 


228.70 


R(\) 


3538.22 


3()54.22 


3770. 58 


4004. 70 


4240.27 


4358.28 


4470. 2() 


4594.06 


A 


0.72 


0.76 


0.80 


0.84 


0.88 


0.92 


0.96 


1.00 


II/3hJI 


232.79 


236.42 


239.65 


242.53 


245.08 


247.34 


249.33 


251.09 


m 


4711.56 


4828.65 


4945.23 


5061.20 


5176.49 


5291.03 


5404.77 


5517.64 



Wf^Rx II' ^^'^ square roots R{X) = [{Y — Z(3j^^)'{Y — Zf3j^^)]^, are reported as A ranges 
systematically over [0, 1]. Recall that this range is stipulated by Hoerl and Kennard (1970) 
and others when Z'Z is in "correlation form." Here /Q^^ = /32, p3, (^4, ps]' consists of 
rates of change; similar trends are exhibited when /3 is expanded to include the intercept. 
It is seen that ||/3/f^ || initially decreases to a minimum, then increases beyond A = 1.0, but 
eventually decreases to zero since /3^^ is a shrinkage estimator. 

Greater detail is seen on recalling from Section 4.4 that /3^^ — QOrx ; that Q is orthogonal; 
and thus, letting g^^W = W^Rx that (A) = gg^{X). The canonical form of Section 
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4.4 assures that 5gjj(A) = Xli=i ^i^i/Hi + ^)^- This is differentiable; its derivative is 

%gJA)/aA = -2j2ui^f{& + A)-'; (5.2) 

i=l 

and its path traces evolution of the derivative as A varies. In particular, at A = we have 
[%g^ (A)/9A]a=o = ~2^*Lj Uf/^f. This is precipitous for the Hospital Manpower Data in 
view of the fact that = ,^5 = 0.007347. 

A detailed local view is provided in Tabic 6, to include not only \\ (3]^^ \\ and i?(A), but 
also the ridge estimates f3jf^ = [Pi, 02, ^3, P4, f3^]' in rows corresponding to various choices 
for A. Values of for A G {0.08, 0.11, 0.12} are as in Table 8.9 of Myers (1990), who 



Table 6. Ridge estimators /3ji^, lengths of /S^^, and square roots R{X) = 
[{Y — Zf3j^^y{Y — Z/3fj^)]^ of residual sums of squares, for designated 
values of A. 



A 


Pi 


P2 


P3 


Pi 


Pr. 






0.08 


10.6354 


0.065428 


0.359139 


6.3206 


-30.7471 


33.1448 


2735.75 


0.08095 


10.6118 


0.065432 


0.358279 


6.3674 


-28.9649 


31.5000 


2740.68 


0.08797 


10.4475 


0.065444 


0.352298 


6.6903 


-16.4728 


20.6250 


2775.83 


0.0981 


10.2378 


0.065414 


0.344681 


7.0942 


-0.3156 


12.4645 


2823.03 


0.09829 


10.2342 


0.065413 


0.344548 


7.1012 


-0.0308 


12.4615 


2823.89 


0.0983 


10.2340 


0.065413 


0.344541 


7.1015 


-0.0159 


12.4615 


2823.93 


0.11 


10.0248 


0.065325 


0.336955 


7.4935 


16.3900 


20.6251 


2874.22 


0.12 


9.8679 


0.065217 


0.331280 


7.7785 


28.8834 


31.5000 


2914.38 



reports ridge estimates for A e [0, 0.24] by increments of 0.01. It is seen that ||/3r^ || takes 

its minimum value, 12.46150, at A,„in = 0.09829. To continue, designate fSj^^ as (3ji{X). 
It is seen that 37?,(0.12) and 3^(0.08095) have the same length, namely, || 3^(0.12) || = 
31.500 = ||3/i(0.08095)||, so that A(31.500) = {0.08095, 0.12} in the notation of (3.6). 
Suppose a user chooses /3;j(0.12) as the ridge estimate for the Hospital Manpower Data. 
Then /3jj(0.12) is not the minimizing solution of length 31.500; this is seen from i?(0.12) = 
2914.38 > 2740.68 = i?(0.08095). Similarly, it is clear that A(20.625) = {0.08797, 0.11} 
as in (3.6), and that /3;j(0.11) of Table 6 is not minimizing, to be supplanted instead by 
/3^(0. 08797) from Table 6. A continuum of further examples can be constructed by reflecting 
A asymmetrically about Amin = 0.09829, the smaller A of each pair corresponding to the 
minimizing solution. These clearly constitute counterexamples to Assertion A5. 

Not only are definitive values for the constraint {(3' (3 = c^} not evident beforehand, but 
profound and heretofore undiscovered limits pertain to admissible values for A in order that 
solutions of given length c* be minimizing. To fix ideas, suppose in equation (3.4) that 
{53^(0.00) > c*2 > > 5^^(0.09829) = 12.46150^ = 155.2877}. Then the only feasible 
values for A are those in the interval [min5fr^(c^), 0.09829]. For example, if 33.14481^ = 
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1098.5784 > > 155.2877, then from Table 6 the feasible values are A e [0.08, 0.09829]. 

For {g^ (0.00) > > (0.09829)}, the feasible values are A e [0.00, 0.09829]. These are 

the only feasible values for A e [0, 1]. On the other hand, choosing {Q < (? < (0.09829) = 

155.2877} requires A in the interval (mint;- ^(c^), 00), where ming""^ (155.2877) > 158. For 

ft? ' ft? 

example, if < 100, then the feasible values arc A G (198, 00). As these are far outside 

the recommended interval [0, 1], constraints c*^ G (0, 155.2877) must be declared to be 

inadmissible. Values reported for {0 < < g-^ (0.09829) = 155.2877} are supported by 

the Maple software package. Values reported for PBESS\ and HKB\ in Table 4 are thus 

inadmissible in view of Assertion A5*. 

In short, imbedded in the Hospital Manpower Data are the hidden feasible constraints 
{/3'/3 = c^} with > 155.2877. These could not have been discerned beforehand short of 
the foregoing detailed analyses. 

To summarize, origins of the anomaly exhibited here may be traced as follows: (i) 
The ridge trace of /?5(A) exhibits a down-up-down character, beginning with I3^{Q.QQ) = 
-394.3280, decreasing to zero between A = 0.09 and A = 0.10, and increasing thereafter to 
/35(1.00) = 250.8307 and beyond, and eventually decreasing to zero through shrinkage, (h) 
I /35(A) I dominates other estimates by orders of magnitude ranging from one to four except 
near its minimum, (iii) Other estimates exhibit relatively narrow ranges in comparison with 
as A varies over [0, 1]. (iv) In consequence, || /J^.^ || is largely determined by [/35(A)]^ as 
A varies. Finally note that A(c*) from (3.6) takes on two values in the cases examined, from 
the down-up-down character of ||/3r^ || as A evolves. It is clear in other circumstances that 
A(c*) may consist of three or more elements. For example, a single dominant estimate may 
exhibit multiple sign changes, whereas estimates for other coefBcients may have one or more 
sign changes as well. These and related matters are studied in Zhang and McDonald (2005), 
and references cited therein, under special structure of Z'Z in correlation form. Properties, 
to include sign changes, crossings, and rates-of-change of individual ridge estimates, as well 
as bounds on the number of sign changes, are determined by those authors on identifying 
zeros and derivatives of polynomials in A of degree A; — 1, under special structure as cited. 

These facts alone challenge the meaning of numerous simulation studies purporting to 
compare alternative criteria for choosing A, when all such choices have ignored the minimiz- 
ing constraints on A. Thus aggregates of minimizing/non-minimizing values are compared 
with other such aggregates, to the effect of total obfuscation. 

We turn next to properties of ridge and surrogate ridge solutions, to include condition 
numbers and other diagnostics. Computations for the condition numbers proceed as in 
Table 3, based on equivalence between conditioning for /3 and the canonical estimators d, 
as noted in Section 4.6. 

5.4. Properties of (3]^^ and I3g^. In summary, the ridge solutions to {Z'Z + Ik) (3 = Z'Y 
account for ill-conditioning of Z'Z on the left of Z'Z(3 = Z'Y, whereas the surrogate solu- 
tions to Z')^Z\(3 = Z'^Y account for ill-conditioning on the right as well. It thus is germane 
to compare {^Sx i — 0} with {/3r^ ; A > 0} using the data at hand. We next examine critical 
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issues from Section 4.5, applicable both to ridge and to surrogate ridge solutions. Table 7 
lists condition numbers and other quantities affiliated with {/3jj^; A > 0} and {^g^;X > 0}, 
under values for A as listed. Question 1 of Section 4.5 is negated for : Stability of 



Table 7. Condition numbers for ^R^iY), 3s^(y), V(3^^), and V(3s^); the 
maximal VIFs VMif^u^) and Vm(/3s^); and the Probenius distance Dz{Z\) 
= \\Z — Z\\\f, under various choices for A. 



A 




ci(/3sj 




ci[V(/3;,J] 




ci[V(/3sJ] 




0.0004 


33.1584 


96.1565 


141.5345 


1099.4770 


1146.399 


9246.064 


0.0140 


0.004787 


9.0957 


29.4630 


10.9688 


82.7319 


112.6300 


868.0653 


0.0638 


0.005 


9.0537 


28.8348 


10.8874 


81.9695 


108.0918 


831.4473 


0.0654 


0.010 


8.1707 


20.4561 


9.2481 


66.7610 


56.6915 


418.4530 


0.0974 


0.030 


11.6724 


11.8596 


21.2905 


136.2440 


21.2197 


140.6508 


0.1847 


0.050 


15.1539 


9.2114 


34.0995 


229.6392 


13.6552 


84.8507 


0.2511 


0.070 


17.8166 


7.8046 


42.5990 


317.4320 


10.2639 


60.9119 


0.3083 


0.090 


20.4222 


6.8997 


51.7827 


417.0673 


8.3166 


47.6061 


0.3598 


0.230 


29.6720 


4.3868 


100.5675 


880.4276 


3.9338 


19.2438 


0.6429 


0.616964 


53.4183 


2.7932 


250.4309 


2853.5130 


2.0374 


7.8022 


1.1769 


1.000 


66.6915 


2.2797 


451.5788 


4447.7550 


1.5976 


5.1968 


1.5745 



the solutions /3fj^, as gauged by ci[/3fj^('F)], initially improves but then erodes. Further 

computations show that ci[l3j^JY)] takes its minimal value, 7.4463, at A = 0.015, and 
increases thcircaftcr. In contrast, despite higher beginning values than ci[f3^jY)], the con- 
dition numbers ci [/3gjY)] for surrogate estimators decrease monotonically with increasing 
A, the trends Ci0r^{Y)] = 11.7723 = ci[3s^(F)] crossing at A = 0.03045. 

Questions 2 and 3 of Section 4.5 are refuted for /3jj^ : Computations interpolating those of 
Table 7 show that ci[V(/3^^)] temporarily decreases over A <E [0, 0.015], where its minimum 
is 55.4470, but it increases thereafter. Similarly, the maximal VIFs for /3jf^ initially decrease 
and then increase. By comparison, both the condition numbers ci[Y{f3g^)], and the maximal 
VIFs for l3g^, decrease with increasing A. Although initially larger, Vm(/3s^) approximates 
Vm{0r^) at a = 0.030, and the ratio Vm{Pr^)/Vm{0Sx) increases markedly thereafter. 

Recall that the surrogate Y = Z\/3 + e is intended as an approximation to Y = Zf3 + e. 
The order of approximation, as gauged by the Frobenius distance (4.3), is tabulated as 
the final column of Table 7. Relative changes, given by || Z — Zx \\p /\\Z \\p, are 0.1123 
at A = 0.05, ranging up to 0.5263 at A = 0.616964, where the denominator is || Z \\p = 
2.236068. 

Further details are given in Tables 8 and 9, from which several entries of Table 7 are 
drawn. Table 8 examines the evolution of VIFs, and conditioning of the correlation 
matrices, for (3^^ as A varies. Values for ci[C(/3^^)] are included, as Lemma 1 applies in 
each case. It is found that ci[C(/3fj^)] achieves its minimum, 61.4449, at A = 0.0173. In all 
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Table 8. Variance inflation factors for /3f{^, and condition numbers for 
C{f3j^^) and T{/3) = E(/3^^), for designated values of A. 



A 


VIFl 


VIF2 


VIF3 


VIF4 


VIF5 


ci[C(/3^J] 


ci[Tm 


0.000 


9595.68 


7.9406 


8931.449 


23.2887 


4.2794 


54756.83 


1.0000 


0.0004 


141.5345 


7.8481 


133.0221 


13.0512 


3.3997 


576.8409 


8.4095 


0.004787 


7.1604 


7.1682 


7.8840 


10.9688 


3.0128 


90.13222 


89.5726 


0.005 


7.1047 


7.1379 


7.8349 


10.8874 


2.9972 


89.50392 


93.5175 


0.010 


8.0001 


6.4919 


8.8456 


9.2481 


2.6830 


75.66936 


185.8150 


0.030 


19.7743 


4.7268 


21.2905 


5.6339 


2.0003 


109.4703 


552.8219 


0.050 


32.0013 


3.6885 


34.0995 


4.0168 


1.6988 


177.2545 


916.3722 


0.070 


42.5990 


3.0187 


45.0695 


3.1473 


1.5363 


227.9178 


1276.515 


0.090 


51.7827 


2.5598 


54.4589 


2.6269 


1.4377 


267.3171 


1633.297 


0.230 


100.5675 


1.3868 


102.4723 


1.6364 


1.2446 


441.9639 


4040.511 


0.616964 


250.4309 


1.0879 


243.2535 


1.8791 


1.3541 


1047.931 


9965.795 


1.000 


451.5788 


1.2184 


430.5957 


2.4738 


1.5386 


2174.418 


14961.96 



Table 9. Variance inflation factors for (3^^, and condition numbers for 
C{l3g^) and V(/3g^), for designated values of A. 



A 


VIFl 


VIF2 


VIF3 


VIF4 


VIF5 




0.000 


9595.68 


7.9406 


8931.449 


23.2887 


4.2794 


54756.83 


0.0004 


1146.399 


7.8846 


1068.211 


14.2203 


3.5089 


5091.248 


0.004787 


112.6300 


7.5308 


106.0234 


12.0987 


3.2190 


458.9380 


0.005 


108.0918 


7.5147 


101.7946 


12.0488 


3.2099 


440.5738 


0.010 


56.6915 


7.1607 


53.8459 


11.0379 


3.0219 


233.7461 


0.030 


21.2197 


6.0737 


20.5412 


8.4506 


2.5374 


93.1862 


0.050 


13.6552 


5.3181 


13.3380 


6.9511 


2.2606 


63.4167 


0.070 


10.2639 


4.7584 


10.0777 


5.9605 


2.0792 


48.8365 


0.090 


8.3166 


4.3258 


8.1934 


5.2538 


1.9500 


39.9124 


0.230 


3.9338 


2.8218 


3.9092 


3.1133 


1.5493 


17.9102 


0.616964 


2.0374 


1.7669 


2.0330 


1.8380 


1.2710 


7.5371 


1.000 


1.5976 


1.4635 


1.5957 


1.4981 


1.1781 


5.0614 



instances each VIF initially decreases, then increases, but values of A at which the changes 
occur differ across the five estimators. If we view E(^^^) = T(/3) as a transformation on 
the parameter space, Question 4 of Section 4.5 asks whether its conditioning improves with 
increasing A. To the contrary, the last column of Table 8 shows that condition numbers 
increase explosively with increasing A. Prom Table 3 it is clear that corresponding condition 
numbers for E{^g^) = T{pi) are square roots of those listed in Table 8 for 
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Similar entries in Table 9 give the evolution of VIFs and ci[C{(3g^)] for /3g^. 

A noted departure from Table 8 is that the maximal VIF is Vm{^Sx) ~ VIF{l3i) for 
all cases, independently of A. Further computations show that the crossing ci[C(/3fj^)] = 
99.56217 = ci[C(PsJ] occurs at A = 0.02750. 

6. Conclusions 

Little of the considerable literature on ridge regression is found to be consistent with the 
optimization of Hoerl and Kennard (1970) under equality constraints {/3 /3 = c^}, and under 
the inequality constraints {/3 /3 < c } of Balakrishnan (1963), despite pervasive claims to 
the contrary. 

The problem is traced to (i) a misapplication of LaGrange's principle; (ii) the false claim 
that the constrained solutions have nonsingular distributions, corresponding one-to-one 
with /3^; and (iii) the implied but incorrect assertion that the ridge parameter A corresponds 
one-to-one with c^, and thus the false claim that the solution /3fj^ of {X'X + Xlk)^ = X'Y 
minimizes the residual sum of squares among estimators of length /3fj^/3fj^ = c*^. Our The- 
orem 1 supplies the missing distributions appropriate to constrained minimization. Gener- 
alized ridge regression, seen as solving the equations [X'X -\- A)f3 = X'Y with nonncgative 
ridge parameters A = Diag(Ai, . . . , Afe), is also shown to be inconsistent with LaGrange 
minimization. 

LaGrange optimization having failed as a rational foundation for conventional ridge re- 
gression, alternatives based on conditioning arc developed in Section 4. Limitations in Type 
A conditioning, on which a justification for (ij^^ rests, prompt the introduction of surro- 
gate ridge solutions, (3g^, to account for ill-conditioning of X on both sides of the OLS 
equations, X'X(3 = X'Y. Extensive numerical studies, as reported in Section 5, reexam- 
ine the Hospital Manpower Data in a manner complementary to the conventional analyses 
undertaken in Myers (1990). It is demonstrated that none of the conditionings of (3j^^{Y), 
E(/3^^) — T{f3), and V(/3;j^), nor the variance inflation factors, as critical properties of the 
ridge estimators {/3^^;A > 0}, is enhanced monotonically on increasing A. In contrast, for 
the surrogate solutions /Jg^, all (except T{/3)) of these are uniformly enhanced as A evolves. 
It is seen that /3fj^ is better within a narrow range for small A, but its VIFs and condition 
numbers often become excessive within the range of A often recommended in practice. In 
short, ridge regression often exhibits some of the very pathologies it is intended to redress. 

In summary, there is a vast and expanding compendium on the so-called theory, method- 
ology, and simulation studies surrounding ridge regression. If indeed constrained optimiza- 
tion is to be pivotal, then the bulk of these studies will have to be reworked to take into 
account the nonstandard distributions of Section 3.3, as well as constraints for the ridge 
parameter to be minimizing, as documented in Sections 3.4 and 5.3. It is remarkable that 
this field of applied engineering has thrived for so long, despite critical false assertions and 
a dearth of sustaining foundation principles. 
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